Make lower and upper emit Utf8View for Utf8View input#20616
Make lower and upper emit Utf8View for Utf8View input#20616kumarUjjawal wants to merge 3 commits intoapache:mainfrom
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Thanks for looking at this!
The PR description has a typo (still talks about repeat).
Personally I find it a bit confusing to have the same / overlapping changes in concurrent PRs (e.g., for upper and lower).
| } | ||
|
|
||
| #[derive(Debug, Clone, Copy)] | ||
| enum Utf8ViewOutput { |
There was a problem hiding this comment.
Why do we need a new enum -- can't we just pass the return data type the caller expects?
| if arg_types[0] == DataType::Utf8View { | ||
| Ok(DataType::Utf8View) | ||
| } else { | ||
| utf8_to_str_type(&arg_types[0], "lower") |
There was a problem hiding this comment.
I wonder if it would be helpful to have a helper here that handles all three string representations and returns the optimal return type. Having a helper for Utf8 vs. LargeUtf8 but handling Utf8View at the call-site seems a bit odd.
| } else { | ||
| string_builder.append_null(); | ||
| match utf8view_output { | ||
| Utf8ViewOutput::Utf8 => { |
There was a problem hiding this comment.
I'm confused why we need to add code to handle the "Utf8View input, Utf8 output" case -- seems like we don't actually want that behavior. If we fixed up upper and lower as part of the same PR, couldn't we just have case_conversion return a value of the same type as its input? That would avoid this redundant code and also get rid of the utf8view_output parameter.
Thanks for the feedbak. I agree I should have done both lower and upper in the same pr, that makes much more sense. I'm working on it, will update the pr and close the other one. Initially I had thougth about it I don't remember why I didn't do it then. |
| } | ||
|
|
||
| #[test] | ||
| fn lower_return_type_dictionary_utf8view() -> Result<()> { |
There was a problem hiding this comment.
For symmetry, should we add an analogous test case for upper?
| } | ||
| } | ||
|
|
||
| pub(crate) fn case_conversion_return_type( |
There was a problem hiding this comment.
This doesn't seem specific to case conversion; we probably need a more generic facility for finding the return return type for a function that takes a string. But we can handle that later.
| name: &str, | ||
| ) -> Result<DataType> { | ||
| match arg_type { | ||
| DataType::Utf8View | DataType::BinaryView => Ok(DataType::Utf8View), |
There was a problem hiding this comment.
I believe BinaryView will be rejected by the type signatures for these functions, so probably clearer if we don't handle it.
| LargeUtf8 | ||
|
|
||
| query T | ||
| SELECT arrow_typeof(upper(arrow_cast('foo', 'Utf8View'))) |
There was a problem hiding this comment.
Add SLT tests for dictionary with a Utf8View value type?
Which issue does this PR close?
Rationale for this change
String UDFs should preserve string representation where feasible. lower and upper previously accepted Utf8View input but emitted Utf8, causing an unnecessary type downgrade. This aligns both with the expected behavior of returning the same string type as its primary input.
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?